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Claims 



1 . A noise reduction system with an audio-visual user interface, said system being specially 
adapted for running an application for combining visual features (o v , nT ) extracted from a 
digital video sequence (v(«r» showing the face of a speaker (S ( ) with audio features Gw) 
extracted from an analog audio sequence 0(r)), wherein said audio sequence 0(f)) can in- 
clude noise in the environment of said speaker (S t ), said noise reduction system (200b/c) " 
comprising 

- means (101a, 106b) for detecting and analyzing said analog audio sequence (s(f)), 

- means (101b') for detecting said video sequence (y(nT)), and 

- means (104a+b, 104'+104") for analyzing the detected video signal (v(«7)), 
characterized by 

a noise reduction circuit (106) being adapted to separate the speaker's voice from said 
background noise («'(f)) based on a combination of derived speech characteristics fe^ 
•- [&>,»/, fiy,r T ] T ) and outputting a speech activity indication signal (s, (nT) ) which is 
obtained by a combination of speech activity estimates supplied by said analyzing means 
(106b, 104a+b, 104'+104"). 

2. A noise reduction system according to claim 1, 
characterized by 

means (SW) for switching off an audio channel in case the actual level of said speech ac- 
tivity indication signal (s^nT)) falls below a predefined threshold value. 

3. A noise reduction system according to anyone of the claims 1 or 2, 
characterized by 

a multi-channel acoustic echo cancellation unit (108) being specially adapted to perform a 
near-end speaker detection and double-talk detection algorithm based on acoustic-phonetic 
^speech characteristics derived by said audio feature extraction and analyzing means (106b) 
and said visual feature extraction and analyzing means (104a+b, 104'+104"). 
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4. A noise reduction system according to anyone of the claims 1 to 3, 
characterized in that 

said audio feature extraction and analyzing means (106b) is an amplitude detector. 

5. A near-end speaker detection method reducing the noise level of a detected analog audio 
sequence ($(/)), 

said method being characterized by the following steps: 

- subjecting (SI) said analog audio sequence (s(t)) to an analog-to-digital conversion, 

- calculating (S2) the corresponding discrete signal spectrum (S(hA/)) of the analog-to- 
digital-converted audio sequence (s(nT)) by performing a Fast Fourier Transform (FFT), 

- detecting (S3) the voice of said speaker (Si) from said signal spectrum (S(k- A/)) by ana- 
lyzing visual features (o v ,„ T ) extracted from a simultaneously with the recording of the 
analog audio sequence (*(*)) recorded video sequence (y{nT)) tracking the current loca- 
tion of the speaker's face, Up movements and/or facial expressions of the speaker (S,) in 
subsequent images, 

- estimating (S4) the noise power density spectrum (O m (/)) of the statistically distrib- 
uted background noise (#?(/)) based on the result of the speaker detection step (S3), 

- subtracting (S5) a discretized version (k • A/)) of the estimated noise power den- 
sity spectrum from the discrete signal spectrum (S(*A/)) of the analog-to- 
digital-converted audio sequence (s(nT)\ and 

- calculating (S6) the corresponding discrete time-domain signal (s t (nT)) of the obtained 
difference signal by performing an Inverse Fast Fourier Transform (IFFT), thereby 
yielding a discrete version of the recognized speech signal. 

6. A near-end speaker detection method according to claim 5, 
characterized by the step of 

conducting (S7) a multi-channel acoustic echo cancellation algorithm which models echo 
path impulse responses by means of adaptive finite impulse response (FIR) filters and sub- 
tracts echo signals from the analog audio sequence (s(t)) based on acoustic-phonetic speech 
characteristics derived by an algorithm for extracting visual features (o v ,„ T ) from a video 
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sequence (v(» T)) tracking the location of a speaker's face, lip movements and/or facial ex- 
pressions of the speaker (Si) in subsequent images. 

7. A near-end speaker detection method according to claim 6, 
characterized in that 

said multi-channel acoustic echo cancellation algorithm performs a double-talk detection 
procedure. 

8. A near-end speaker detection method according to anyone of the claims 5 to 7, 
characterized in that 

said acoustic-phonetic speech characteristics are based on the opening of a speaker's mouth 
as an estimate of the acoustic energy of articulated vowels or diphthongs, respectively, 
rapid movement of the speaker's lips as a hint to labial or labio-dental consonants, respec- 
tively, and other statistically detected phonetic characteristics of an association berween 
position and movement of the hps and the voice and pronunciation of said speaker (Si). 

9. A near-end speaker detection method according to anyone of the claims 5 to 8, 
characterized by 

a learning procedure used for enhancing the step of detecting (S3) the voice of said speaker 
CSS) from the discrete signal spectrum (S(hAJ)) of the analog-to-digital-converted version 
(s(nT)) of an analog audio sequence 0(0) by analyzing visual features (o vmT ) extracted 
from a simultaneously with the recording of the analog audio sequence (s(t)) recorded 
video sequence (v(nT)) tracking the current location of the speaker's face, Up movements 
and/or facial expressions of the speaker (&) in subsequent images. 

10. A near-end speaker detection method according to anyone of the claims 5 to 9, 
characterized by the step of 

correlating (S8a) the discrete signal spectrum (S^kAf)) of a delayed version MoT-*)) of the 
analog-to-digital-converted audio signal (sQtT)) with an audio speech activity estimate ob- 
tained by an amplitude detection (58b) of the band-pass-filtered discrete signal spectrum 
0**40). thereby yielding an estimate ($(/)) for the frequency spectrum (Stf)) corre- 
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spending to the signal 0,(0) which represents said speaker's voice as well as an estimate 
,»(/)) for the noise power density spectrum (<£„„(/)) of the statistically distributed 
background noise (n'(t)). 

11. A near-end speaker detection method according to claim 10, 
characterized by the step of 

correlating (S9) the discrete signal spectrum (S x (kAff) of a delayed version (s(nT-x)) of the 
analog-to-digital-converted audio signal (s(n7)) with a visual speech activity estimate taken 
from a visual feature vector (o v ,,) supplied by the visual feature extraction and analyzing 
means (104a+b, 104'+104»), thereby yielding a further estimate ($'(/)) for updating the 
estimate (g, (/)) for the frequency spectrum {Stf)) corresponding to the signal 
which represents said speaker's voice as well as a further estimate ($„„ ' (/)) for updating 

the estimate for the noise power density spectrum (<&„„(/)) of the statistically 

distributed background noise (»'(r)). 

12. A near-end speaker detection method according anyone of the claims 10 or 1 1, 
characterized by the step of 

adjusting (S10) the cut-off frequencies of a band-pass filter (204) used for filtering the dis- 
crete signal spectrum (S(kAfi) of the analog-to-digital-converted audio signal (s(t)) de- 
pendent on the bandwidth of the estimated speech signal spectrum (£,(/)) . 

13. A near-end speaker detection method according to anyone of the claims 5 to 9, 
characterized by the steps of 

- adding (SI la) an audio speech activity estimate obtained by an amplitude detection of 
the band-pass-filtered discrete signal spectrum (S(hAf)) of the analog-to-digital- 
converted audio signal (,(/)) to a visual speech activity estimate taken from a visual 
feature vector (o v ,<) supplied by said visual feature extraction and analyzing means 
(104a+b, 104'+104''), thereby yielding an audio-visual speech activity estimate, 

- correlating (SI lb) the discrete signal spectrum (S(k Af» with the audio-visual speech 
activity estimate, thereby yielding an estimate ($(/)) for the frequency spectrum (£{/)) 
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corresponding to the signal 0,{f)) which represents said speaker's voice as well as an 
estimate ($„„(/)) for the noise power density spectrum (<£ „(/)) of the statistically 
distributed background noise (n'(t)) and 
- adjusting (S 1 1 c) the cut-off frequencies of a band-pass filter (204) used for filtering the 
discrete signal spectrum (S(kAf)) of the analog-to-digital-converted audio signal (j(r)) 
dependent on the bandwidth of the estimated speech signal spectrum (S; (/)) . 

14. Use of a noise reduction system (200b/c) according to anyone of the claims 1 to 4 and i 
near-end speaker detection method according to anyone of the claims 5 to 13 for a video- 
telephony based application in a telecommunication system running on a video-enabled 
phone with a built-in video camera (101b') pointing at the face of a speaker (S t ) participat- 
ing in a video telephony session. 



15 



15. A telecommunication device equipped with an audio-visual user interface, 
characterized by 

noise reduction system (200b/c) according to anyone of the claims 1 to 4. 



